CUG XT SIG report
Chair Robert Ballance
Notes from Bob Ballance (SNL) and Charlie Carroll (Cray)
The group was evenly split among attendees who were interested in systems issues and in applications issues, but voted to remain in a common caucus.
1. Report from XTrEme
Richard Alexander of CSCS outlined the goals and purpose of the XTrEme organization; a collection of large XT sites that is being set up to operate along the lines of SP XXL or CASC.
2. Open Issues -- what users wants from Cray and what does Cray wants from users?
- Asynchronous distributions
Users want Cray to provide documentation on dependencies among the software stack and various async packages. All documentation should be placed on a web site, with data for all the released versions.
Nick Cardo requested a table of releases and dependencies through the Cray support web
pages.
Bob Ballance wanted to know when the scheduler will go async.
- Why not source code?
Makia Minnich of ORNL was the most vocal spokesperson for source access. PSC and others joined in. Charlie Carroll agreed that Cray is not doing a good job at supplying. Cray has an active program under way to improve its build and source handling.
Sites doing infiniband and other kernel hacking (ORNL, CSCS, PSC) want source code.
Issues in providing source code include:
Build is a painful process
Everything is intertwingled
ALPS and other software may contain non GPL code
Kernel
Lustre
MPI, MPI I/O
Charlie Carroll - Alps is proprietary; Anything GPL'd will be available
- What changes do sites want?
* Better diagnostics on HSN (PSC, but all sites agreed)
Katie Vargo of PSC wished for better diagnostics of HSN failures. Much agreement. Nick
Cardo noted that the diags should be on-line.
* We use our production runs as diagnostics
* Need online diagnostics
* Need real-time diagnostics
* Add a 9 or more to reliability
* More reliability in DDN 8500 (SNL, CSCS)
Documentation preview for XT 2.0
Cray asked for feedback on the documentation preview for XT 2.0. The link given actually pointed to the XT 1.5 release. The link was fixed, but not in time for users to print and read preview prior to CUG.
Software support and the SPR Process
Sites continue to request that Cray share SPRs and information. Too often, known problems arise at new sites, and the connections are left to the sites. Don Mengel explained the process, and the needs from the Cray side. The general consensus was that it is taking too long to find a working solution.
Lustre slowness (AWE)
How can we get more frequent updates to Lustre?
Release issues
Release dates seem to come very quickly, sometimes with little gain for sites. Once a release is available, the prior releases are not supported as long as sites need. Nic Cardo requested that releases not appear in Nov, Dec, January. Katie Vargo and others requested a minimum period of support based on release dates, e.g. either 1 year from GA of the release, or 6 months after GA of the next release. 1 month after GA is way too short!
From Charlie Carroll: Cray's stated policy is that they stop regular updates of major release N 30 days after major N+1 is released. Katie Vargo, who raised this issue, pointed out that PSC is considering going to v1.5 but doesn't like that releases for it are expected to end in October (only five months from now).
Two points:
- She'd like v1.5 support to not depend on the vagaries of the v2.0 release. For planning
purposes, she'd like a statement such as: V1.5 will be supported for N months or years after its release.
- Thirty days is too short. (General agreement from the crowd.)
Applications
Mark Fahey of ORNL raised an issue with his application performance. Jim Harrell talked to Mark afterward and I think they worked on some ideas and approaches.
Schedulers
Cardo - Job launcher should be able reject application. For example, detecting that a statically linked binary has older libraries, and so job would fail. He'd also like job/time limits on interactive jobs. Richard Alexander asked for memory usage reports after job runs. recompile and job will fail
What can we do better next year?
Consensus building and rankings of issues.